Head Motion Generation with Synthetic Speech: A Data Driven Approach
نویسندگان
چکیده
To have believable head movements for conversational agents (CAs), the natural coupling between speech and head movements needs to be preserved, even when the CA uses synthetic speech. To incorporate the relation between speech head movements, studies have learned these couplings from real recordings, where speech is used to derive head movements. However, relying on recorded speech for every sentence that a virtual agent utters constrains the versatility and scalability of the interface, so most practical solutions for CAs use text to speech. While we can generate head motion using rule-based models, the head movements may become repetitive, spanning only a limited range of behaviors. This paper proposes strategies to leverage speech-driven models for head motion generation for cases relying on synthetic speech. The straightforward approach is to drive the speech-based models using synthetic speech, which creates mismatch between the test and train conditions. Instead, we propose to create a parallel corpus of synthetic speech aligned with natural recordings for which we have motion capture recordings. We use this parallel corpus to either retrain or adapt the speech-based models with synthetic speech. Objective and subjective metrics show significant improvements of the proposed approaches over the case with mismatched condition.
منابع مشابه
A Low Bit-rate Web-enabled Synthetic Head with Speech-driven Facial Animation
In this paper, an approach that animates facial expressions through speech analysis is presented. An individualized 3D head model is first generated by modifying a generic head model, where a set of MPEG-4 Facial Definition Parameters (FDPs) has been pre-defined. To animate realistic facial expressions of the 3D head model, key frames of facial expressions are calculated from motion-captured da...
متن کاملSpeech-driven head motion synthesis using neural networks
This paper presents a neural network approach for speech-driven head motion synthesis, which can automatically predict a speaker’s head movement from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a multi-layer perceptron from audio-visual broadcast news data. First, we show that a generatively pre-trained neural network significantly outperforms a randomly i...
متن کاملArticulatory features for speech-driven head motion synthesis
This study investigates the use of articulatory features for speech-driven head motion synthesis as opposed to prosody features such as F0 and energy that have been mainly used in the literature. In the proposed approach, multi-stream HMMs are trained jointly on the synchronous streams of speech and head motion data. Articulatory features can be regarded as an intermediate parametrisation of sp...
متن کاملAssociating Facial Displays with Syntactic Constituents for Generation
We present an annotated corpus of conversational facial displays designed to be used for generation. The corpus is based on a recording of a single speaker reading scripted output in the domain of the target generation system. The data in the corpus consists of the syntactic derivation tree of each sentence annotated with the full syntactic and pragmatic context, as well as the eye and eyebrow ...
متن کاملRecreation of spontaneous non-verbal behavior on a synthetic agent EVA
This paper presents a novel process of transferring the human-generated communicative behavior onto an embodied conversational agent. The aim of our work is to build a high-resolution motion dictionary based on empirical analysis of non-verbal behavior performed in multi-speaker informal dialogues. The verbal and non-verbal behavior is recreated by using this motion dictionary and on pure, unpr...
متن کامل